Employee Turnover Prediction

Anil Dwarakanath, Sr Cloud Solution Architect - Data and AI, Partner Success Team

Recap of the customer scenario

  • Customer is a sportswear company
  • Having difficulty controlling their employee turnover rates
  • Customer would like to understand who is likely to leave company and why
  • Activities they can engage in to start decreasing the volume of employees that leave
  • Customer Business leaders would like to understand causality if at all possible especially to understand which variable they should be looking at and if there are any more that would be worthwhile to try to get for future attempts.

Considerations

  • Explainability: Explain local feature effects for every observation of the holdout dataset.
  • Bias: It is of critical importance that this model does not bias against any protected classes. need to ensure that the outcome of this model does not result in any such bias. No bias beyond 5% outcome disparity is acceptable in any protected class.

Data Set

  • Customer has collected the data on their employees that they have found so far and would like to use their data to predict which employee are at risk of leaving and why.

Approach

In order to approach the above customer questions mentioned above we will take the approach of Data Science as outlined below.

  • Understand the data points using visualization techniques to understand how the various features are related to each other. We use Python matplotlib or Power BI. Here we be using Power BI python integtration to plot various visuals quickly.
  • We will use Principal Component Analysis technique and create PCA plots in Power BI to understand key features that impact the turnover rate.
  • We will use machine learning techniques to build a prediction model that uses key features
  • We will understand protected classes and we will use Fairlearn to mitigate bias. Fairlearn has built in capabilities to detect bias in protected classes using algorithms that support a set of constraints on the predictor's behavior called parity constraints or criteria.
  • We will use Azure Machine Learning interpretation libraries to explain model using a visual dashboard

Assumptions

  • Since the dataset does not have details on what aspects like LinkedIn skill code, we will assume that skill code maps to certain linkedin activity like exploring job opportunities, learning new skills, new connections etc.
  • We will assume that the protected classes that can cause bias in the prediction model to be limited to 'Race (Code)' and 'Years of Service' features

Data Exploration and Visualization

In order to proceed further will go ahead and try to understand various features and insights in the data set.

Setup

We will use Azure Machine Learning Service and create Jupyter notebooks to build the model. AML Notebooks provide a convinient way to create and manage notebooks and help to scale if more processing power is required.

We will first import various depedencies that will be required.

In [2]:
import logging
from matplotlib import pyplot as plt
import pandas as pd
import os

import azureml.core
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
print("This notebook was created using version 1.17.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")
ws = Workspace.from_config()
output = {}
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location

pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T
This notebook was created using version 1.17.0 of the Azure ML SDK
You are currently using version 1.17.0 of the Azure ML SDK
Out[2]:
Subscription ID 3e0e14b3-7e28-4da7-97de-0f5cb324f030
Workspace ml-service
Resource Group ml
Location westus2

Data

We will upload the dataset to Azure blob Storage and access it using AML Dataset interface. We will create training and validation datasets as well. The prediction label is EmployeeLeft which will be used down the line.

In [5]:
data = "https://anildwablobstorage.blob.core.windows.net/public/EmployeeTurnoverDataset.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
training_data, validation_data = dataset.random_split(percentage=0.8, seed=223)
label_column_name = 'EmployeeLeft'
training_data.to_pandas_dataframe()
Out[5]:
Current LinkedIn Activity Email Domain EmployeeLeft LinkedIn Skill Code Mentor Program Involvement (Scaled) Negative Review in Past 5 Years Race (code) Recruiting Location Code Recruiting Method Code Survey, Relative, Attitude toward Manager ... Survey, Relative, Attitude toward Work/Life Balance Survey, Relative, Attitude toward Workload Survey, Relative, Peer's Average Attitude toward Environment Survey, Relative, Peer's Average Attitude toward Manager Survey, Relative, Peer's Average Attitude toward Peers Survey, Relative, Peer's Average Attitude toward Resources Survey, Relative, Peer's Average Attitude toward Workload Survey, Relative, Peer's Average Review of Employee Weekly Hours Worked Years of Service
0 2 timesonline.co.uk 1.0 43747 2 1 0 43747 43747 2 ... 1 6 1 6 8 3 5 1 42 2
1 6 nih.gov 1.0 21249 4 0 3 10000 43747 2 ... 2 2 3 4 4 1 3 4 42 6
2 3 mysql.com 1.0 10000 3 1 1 54996 43747 1 ... 1 2 3 6 2 5 2 1 42 3
3 2 admin.ch 0.0 32498 3 1 0 88743 77494 1 ... 2 7 4 5 8 2 6 1 46 2
4 3 admin.ch 1.0 10000 2 1 2 10000 88743 2 ... 1 6 3 2 1 5 5 1 34 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11207 5 admin.ch 1.0 10000 1 1 2 10000 43747 4 ... 1 3 3 2 6 4 6 1 50 1
11208 6 mysql.com 1.0 32498 3 1 2 66245 66245 1 ... 1 4 5 6 6 5 5 2 42 3
11209 0 timesonline.co.uk 0.0 10000 4 0 0 43747 88743 4 ... 2 2 5 1 7 4 4 5 42 2
11210 2 ucoz.com 0.0 10000 1 1 0 77494 10000 4 ... 2 4 5 4 8 4 1 2 46 3
11211 6 fastcompany.com 1.0 21249 1 0 0 10000 10000 1 ... 2 3 2 6 7 6 1 4 38 6

11212 rows × 22 columns

Data Visualization in Power BI

We will python integration built into Power BI and to principal component analysis. We will create a Scree Plot using PCA variance ratio and compare across the Principal Components in Power BI. We are also converting categorical features to numeric as part of feature engineering.

In [7]:
import pandas as pd 
import numpy as np
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn import preprocessing
import matplotlib.pyplot as plt

df = dataset.to_pandas_dataframe()
df['Email Domain'] = df['Email Domain'].astype('category')
df['Email Domain Category'] = df['Email Domain'].cat.codes
df.drop('Email Domain', axis=1, inplace=True)
df['Recruiting Location Code'] = df['Recruiting Location Code'].astype('category')
df['Recruiting Location Code Category'] = df['Recruiting Location Code'].cat.codes
df.drop('Recruiting Location Code', axis=1, inplace=True)

df['Recruiting Method Code'] = df['Recruiting Method Code'].astype('category')
df['Recruiting Method Code Category'] = df['Recruiting Method Code'].cat.codes
df.drop('Recruiting Method Code', axis=1, inplace=True)

df['LinkedIn Skill Code'] = df['LinkedIn Skill Code'].astype('category')
df['LinkedIn Skill Code Category'] = df['LinkedIn Skill Code'].cat.codes
df.drop('LinkedIn Skill Code', axis=1, inplace=True)

#print(df.head())
#print(df.shape)
scaled_data = preprocessing.scale(df.T)
pca = PCA() # create a PCA object
pca.fit(scaled_data) # do the math
pca_data = pca.transform(scaled_data) # get PCA coordinates for scaled_data

per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]

pca_variance_df = pd.DataFrame(per_var, index=labels,columns=['variance_ratio'])
pca_variance_df['PCA'] = pca_variance_df.index

pca_df = pd.DataFrame(pca_data, index=[df.columns], columns=labels)
pca_df['columns'] = pca_df.index
pca_df
Out[7]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 ... PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 columns
Current LinkedIn Activity -22.299782 -1.059540 -3.987979 -0.093703 -4.897862 -18.656205 14.317588 4.169611 6.030546 3.330479 ... 0.866181 -0.026295 -2.464519 0.612245 0.263668 -0.181580 -0.637434 -0.573214 2.772435e-14 (Current LinkedIn Activity,)
EmployeeLeft -55.601145 -1.584262 -2.563472 -1.312997 0.527003 -1.979724 0.923500 0.179224 0.518388 0.030798 ... 0.141991 -0.814431 2.877060 -2.345306 -0.810814 1.370649 3.945718 5.461348 -3.679002e-14 (EmployeeLeft,)
Mentor Program Involvement (Scaled) -30.315634 -0.443958 -2.652564 -0.288078 -1.139625 -0.953429 -0.946130 -0.405789 -1.583333 -3.257107 ... -11.373460 -7.148135 -6.484337 1.711870 0.238102 -0.147576 -0.419375 -0.074381 -1.765775e-14 (Mentor Program Involvement (Scaled),)
Negative Review in Past 5 Years -58.960232 -0.440707 -1.814097 -0.429896 0.094063 -0.232399 -0.484180 -0.142863 0.035233 -0.347851 ... -0.531946 -0.117237 4.162687 -3.122993 -1.191291 1.921430 7.301410 -3.396169 -4.600834e-14 (Negative Review in Past 5 Years,)
Race (code) -40.730311 -1.963489 -3.038638 -16.286199 21.107665 0.640899 1.522134 0.317898 1.132520 0.643063 ... 0.371264 0.025897 -2.328261 1.389301 0.312219 -0.358681 -0.523578 -0.336433 -3.514073e-14 (Race (code),)
Survey, Relative, Attitude toward Manager -36.898928 -0.515415 -1.712331 -0.367434 -0.547827 -0.023928 -0.496420 -0.451394 -0.383358 -0.670431 ... -0.570624 -0.278897 6.861466 -5.326427 11.519197 -3.596667 -1.502102 -0.195313 -2.507716e-14 (Survey, Relative, Attitude toward Manager,)
Survey, Relative, Attitude toward Peers -8.892259 0.085300 -1.701297 0.712798 -3.350345 -1.236725 0.291675 -1.319566 -5.691990 -13.331766 ... 2.236848 -1.333275 -2.330300 -0.078787 0.009876 0.147312 -0.090437 -0.033567 -9.905271e-16 (Survey, Relative, Attitude toward Peers,)
Survey, Relative, Attitude toward Resources -14.870893 0.111914 -1.907847 -0.556779 -1.779359 0.285952 -2.235756 -0.851921 -4.643222 0.805697 ... 18.390535 -3.866306 -3.746739 -0.027129 -0.015174 0.074738 -0.440187 -0.191694 -6.061124e-15 (Survey, Relative, Attitude toward Resources,)
Survey, Relative, Attitude toward Work/Life Balance -38.231628 -0.491518 -1.777572 -0.264552 -0.424576 -0.128785 -0.409256 -0.296054 -0.457916 -0.527616 ... -0.710047 0.363211 5.637101 -4.065266 -8.287311 -9.064612 -1.852577 -0.197132 -2.808170e-14 (Survey, Relative, Attitude toward Work/Life Balance,)
Survey, Relative, Attitude toward Workload -6.277387 0.642923 -1.423051 -0.188424 -2.045745 -0.833494 -2.482984 -1.501169 -7.351025 -4.433723 ... -3.733277 -2.912249 -1.978140 -0.244723 0.037310 0.055764 -0.084816 -0.033664 2.539635e-15 (Survey, Relative, Attitude toward Workload,)
Survey, Relative, Peer's Average Attitude toward Environment -9.437858 0.219486 -1.006988 0.403466 -1.316856 -0.785088 -0.727439 -0.322531 -2.126032 -0.420524 ... 0.290272 0.407322 9.075557 14.811969 0.311640 0.000444 -0.176858 -0.027328 1.587272e-16 (Survey, Relative, Peer's Average Attitude toward Environment,)
Survey, Relative, Peer's Average Attitude toward Manager -9.466721 0.270568 -1.501329 -0.637987 -2.019630 0.094653 -0.920314 -0.425108 -3.889580 -1.398443 ... -1.202866 19.643123 -3.894681 0.123122 0.361316 0.203099 -0.066996 0.126671 -2.706169e-16 (Survey, Relative, Peer's Average Attitude toward Manager,)
Survey, Relative, Peer's Average Attitude toward Peers 0.916232 28.197189 18.002269 -1.095059 0.786701 -0.125490 1.159813 0.024163 2.050458 0.339954 ... -0.113833 -0.393292 -0.325927 -0.335139 -0.028378 0.037655 0.081460 0.165872 9.346690e-15 (Survey, Relative, Peer's Average Attitude toward Peers,)
Survey, Relative, Peer's Average Attitude toward Resources -9.817708 0.016140 -1.459684 -0.306578 -2.512743 -0.529404 -2.705876 -0.518232 -5.162175 -2.663716 ... -1.106703 -1.233775 -2.562713 0.247425 0.131524 0.014351 -0.318572 -0.148478 -1.752071e-15 (Survey, Relative, Peer's Average Attitude toward Resources,)
Survey, Relative, Peer's Average Attitude toward Workload -8.199086 0.321881 -1.747025 0.714748 -2.803110 0.948212 -2.952439 -1.419574 -10.594341 18.193355 ... -2.686140 -1.432678 -1.924054 -0.290139 0.085632 0.110536 -0.135602 -0.025441 -2.133710e-16 (Survey, Relative, Peer's Average Attitude toward Workload,)
Survey, Relative, Peer's Average Review of Employee -26.975373 -0.229269 -2.099007 23.510994 14.166235 0.998003 1.159226 -0.885652 2.269839 0.334818 ... 0.456640 0.321011 -1.618125 0.368301 0.167322 -0.063629 -0.191334 -0.006287 -1.479979e-14 (Survey, Relative, Peer's Average Review of Employee,)
Weekly Hours Worked 526.810115 -1.205608 -1.670108 -0.504945 1.056184 0.041034 0.106400 -0.001772 1.182255 -0.014543 ... -0.343442 -0.373715 1.140406 -1.134924 -0.262192 0.387313 0.350282 0.055070 3.111122e-13 (Weekly Hours Worked,)
Years of Service -25.043385 -0.342544 -2.849185 0.121013 -3.497406 4.895632 -9.388475 20.991192 7.698957 0.713909 ... 0.243790 -0.212804 -1.883513 0.666422 0.175976 -0.125681 -0.223298 -0.029217 -1.385177e-14 (Years of Service,)
Email Domain Category -30.633657 -0.253983 -4.434714 -1.541203 -4.850957 -1.091943 -11.177732 -14.373726 15.205978 1.975126 ... 0.386670 0.208795 -2.562002 1.023046 0.132204 -0.108120 -0.279252 -0.067464 -2.364775e-14 (Email Domain Category,)
Recruiting Location Code Category -13.761499 -19.838955 26.813657 0.038056 -0.748369 0.064875 0.203107 -0.364925 1.345621 0.364499 ... 0.018780 -0.160858 -0.828382 0.138874 0.035305 -0.064777 -0.042617 -0.037609 -1.585190e-14 (Recruiting Location Code Category,)
Recruiting Method Code Category -25.612620 -0.819400 -3.811192 -1.055534 -5.938384 18.863996 15.864495 -2.221466 4.357214 0.924280 ... 0.017581 -0.626111 -1.531339 0.747791 0.042983 0.086949 -0.166244 -0.035275 -1.847827e-14 (Recruiting Method Code Category,)
LinkedIn Skill Code Category -55.700240 -0.676753 -1.657848 -0.571708 0.134944 -0.256642 -0.620939 -0.180343 0.055961 -0.590257 ... -1.048212 -0.039304 6.708754 -4.869534 -3.229115 9.301082 -4.527592 -0.400293 -3.877974e-14 (LinkedIn Skill Code Category,)

22 rows × 23 columns

Principal Component Analysis

The scree plot shows that principall component PC1 which maps to Current LinkedIn Activity feature is contributing 96% of the variance to EmployeeLeft feature compared to other Principal components.

Now if we visualize 'Current LinkedIn Activity' and 'EmployeeLeft' and slice it by count we get this. This confirms that the 'Current LinkedIn Activity' is related to 'EmployeeLeft' and this could potentially show causality for the employee turnover.

Now if we plot scatter diagram with 'Current LinkedIn Activity' and 'EmployeeLeft', we can see this.

The diagram shows that there is a non-linear relationship between 'Current LinkedIn Activity' and 'EmployeeLeft' and we could use Logistic Regression alogrithm to build the prediction model.

We will proceed with building a machine learning model with Logistic Regression.

Building the prediction model using Logistic Regression

We will use sklearn to build logistic regression model.

Initialize dataset for Logistic Regression

In [38]:
# Initialize dataset
X_raw = dataset.to_pandas_dataframe()
X_raw.drop('EmployeeLeft', axis=1, inplace=True)
Y = dataset.to_pandas_dataframe()['EmployeeLeft']
X_raw
Out[38]:
Current LinkedIn Activity Email Domain LinkedIn Skill Code Mentor Program Involvement (Scaled) Negative Review in Past 5 Years Race (code) Recruiting Location Code Recruiting Method Code Survey, Relative, Attitude toward Manager Survey, Relative, Attitude toward Peers ... Survey, Relative, Attitude toward Work/Life Balance Survey, Relative, Attitude toward Workload Survey, Relative, Peer's Average Attitude toward Environment Survey, Relative, Peer's Average Attitude toward Manager Survey, Relative, Peer's Average Attitude toward Peers Survey, Relative, Peer's Average Attitude toward Resources Survey, Relative, Peer's Average Attitude toward Workload Survey, Relative, Peer's Average Review of Employee Weekly Hours Worked Years of Service
0 2 timesonline.co.uk 43747 2 1 0 43747 43747 2 4 ... 1 6 1 6 8 3 5 1 42 2
1 6 nih.gov 21249 4 0 3 10000 43747 2 4 ... 2 2 3 4 4 1 3 4 42 6
2 3 mysql.com 10000 3 1 1 54996 43747 1 3 ... 1 2 3 6 2 5 2 1 42 3
3 1 mysql.com 10000 3 1 2 43747 54996 2 4 ... 1 3 3 3 7 7 2 1 46 5
4 2 admin.ch 32498 3 1 0 88743 77494 1 3 ... 2 7 4 5 8 2 6 1 46 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13995 6 mysql.com 32498 3 1 2 66245 66245 1 4 ... 1 4 5 6 6 5 5 2 42 3
13996 0 timesonline.co.uk 10000 4 0 0 43747 88743 4 5 ... 2 2 5 1 7 4 4 5 42 2
13997 2 ucoz.com 10000 1 1 0 77494 10000 4 3 ... 2 4 5 4 8 4 1 2 46 3
13998 2 mysql.com 10000 5 0 0 32498 21249 2 4 ... 2 7 3 3 8 4 1 2 50 0
13999 6 fastcompany.com 21249 1 0 0 10000 10000 1 2 ... 2 3 2 6 7 6 1 4 38 6

14000 rows × 21 columns

In [39]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Scale the dataset
A = X_raw[['Race (code)','Years of Service']]
X_dummies = pd.get_dummies(X_raw)

sc = StandardScaler()
X_scaled = sc.fit_transform(X_dummies)
X_scaled = pd.DataFrame(X_scaled, columns=X_dummies.columns)


le = LabelEncoder()
Y = le.fit_transform(Y)

Create training and test datasets

In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(X_scaled, 
                                                    Y, 
                                                    A,
                                                    test_size = 0.2,
                                                    random_state=0,
                                                    stratify=Y)

# Work around indexing issue
X_train = X_train.reset_index(drop=True)
A_train = A_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
A_test = A_test.reset_index(drop=True)

Initialize model and train

In [41]:
from sklearn.linear_model import LogisticRegression
unmitigated_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)
unmitigated_predictor.fit(X_train, Y_train)
Out[41]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Test the model

In [44]:
# call the predict function on the model
y_pred = unmitigated_predictor.predict(X_test)
y_pred
X_test
Out[44]:
Current LinkedIn Activity LinkedIn Skill Code Mentor Program Involvement (Scaled) Negative Review in Past 5 Years Race (code) Recruiting Location Code Recruiting Method Code Survey, Relative, Attitude toward Manager Survey, Relative, Attitude toward Peers Survey, Relative, Attitude toward Resources ... Survey, Relative, Peer's Average Attitude toward Workload Survey, Relative, Peer's Average Review of Employee Weekly Hours Worked Years of Service Email Domain_admin.ch Email Domain_fastcompany.com Email Domain_mysql.com Email Domain_nih.gov Email Domain_timesonline.co.uk Email Domain_ucoz.com
0 0.026666 1.517488 -0.973344 1.009906 -0.919007 0.639203 -0.404225 2.034064 0.093902 -0.303750 ... 0.696150 -0.336652 -1.153493 0.133105 -0.437479 -0.487018 1.850616 -0.385701 -0.328684 -0.491615
1 -0.535570 0.336563 -0.973344 -0.990191 -0.919007 0.639203 -0.404225 1.029269 -1.825265 0.372430 ... 1.328234 0.713277 1.652815 -0.442752 -0.437479 -0.487018 -0.540361 -0.385701 -0.328684 2.034113
2 0.026666 1.517488 0.400604 -0.990191 0.686446 -1.535486 0.704000 1.029269 0.093902 0.372430 ... 1.328234 -0.861617 -1.855070 0.708963 -0.437479 2.053313 -0.540361 -0.385701 -0.328684 -0.491615
3 1.713373 -0.844361 1.774552 1.009906 0.151295 0.639203 -0.404225 -0.980321 0.733625 -0.303750 ... 0.064066 -0.336652 0.249661 0.708963 -0.437479 -0.487018 1.850616 -0.385701 -0.328684 -0.491615
4 1.713373 -0.844361 0.400604 -0.990191 -0.919007 0.639203 -0.404225 -0.980321 -0.545820 -0.979929 ... 0.064066 -0.336652 1.652815 1.284820 -0.437479 -0.487018 -0.540361 -0.385701 -0.328684 2.034113
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2795 0.026666 0.336563 -0.973344 -0.990191 0.686446 0.204266 -0.404225 -0.980321 -1.185543 0.372430 ... 1.328234 -0.861617 0.249661 -0.442752 -0.437479 2.053313 -0.540361 -0.385701 -0.328684 -0.491615
2796 1.151138 -0.844361 -0.973344 -0.990191 -0.919007 1.509079 1.258112 2.034064 -1.185543 0.372430 ... 0.064066 -0.336652 0.249661 -1.018610 -0.437479 -0.487018 -0.540361 2.592683 -0.328684 -0.491615
2797 0.026666 -0.844361 0.400604 1.009906 -0.919007 -0.230672 0.704000 -0.980321 1.373347 -0.303750 ... -1.200101 -0.861617 0.249661 1.860678 -0.437479 -0.487018 -0.540361 -0.385701 -0.328684 2.034113
2798 -1.660041 0.336563 0.400604 -0.990191 -0.919007 -0.665610 2.366336 0.024474 -1.185543 -0.303750 ... 1.328234 -0.861617 1.652815 -1.594467 -0.437479 -0.487018 -0.540361 -0.385701 3.042435 -0.491615
2799 -0.535570 -0.844361 0.400604 1.009906 0.151295 -0.665610 1.812224 -0.980321 0.733625 0.372430 ... 1.328234 0.713277 0.249661 -1.018610 -0.437479 -0.487018 1.850616 -0.385701 -0.328684 -0.491615

2800 rows × 26 columns

Register the model in AML

In [32]:
lr_reg_id = register_model("fairness_employeeturover_logistic_regression", unmitigated_predictor)
Registering  fairness_employeeturover_logistic_regression
Registering model fairness_employeeturover_logistic_regression
Registered  fairness_employeeturover_logistic_regression:3

Plot the confusion matrix to visualize the prediction accuracy

In [43]:
from sklearn.metrics import confusion_matrix
import numpy as np
import itertools

cf =confusion_matrix(Y_test,y_pred)
plt.imshow(cf,cmap=plt.cm.Blues,interpolation='nearest')
plt.colorbar()
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
class_labels = ['False','True']
tick_marks = np.arange(len(class_labels))
plt.xticks(tick_marks,class_labels)
plt.yticks([-0.5,0,1,1.5],['','False','True',''])
# plotting text value inside cells
thresh = cf.max() / 2.
for i,j in itertools.product(range(cf.shape[0]),range(cf.shape[1])):
    plt.text(j,i,format(cf[i,j],'d'),horizontalalignment='center',color='white' if cf[i,j] >thresh else 'black')
plt.show()

The model is having a good accuracy of 86%.

Understanding Bias

We will use fairlearn to understand if there is any bias in the model. We will use these protected classes 'Race (Code) ' and 'Years of Service.' features for now to understand bias. Bias could be introduced into the model due to various factors in the underlying data. For e.g. Race with a particular code could be under-represented and could affect results adversly on the real world datasets by incorrect predicting a particular employee with a specific Race code as leaving or not leaving.

Configure Fairlearn

Fairlearn comes with various alogrithms that can detect bias and provide mitigation using certain parity constraints. We will use using GridSearch on Logistic Regression with DemographicParity. DemographicParity works well for binary classification with Race code type of features.

Visaulize bias in unmitigated model

We will create an experiment and upload the fairness metrics for the unmitigated model.

In [33]:
sf = { 'Race': A_test['Race (code)'], 'YearsofService': A_test['Years of Service']}
ys_pred = { lr_reg_id:unmitigated_predictor.predict(X_test) }
from fairlearn.metrics._group_metric_set import _create_group_metric_set

dash_dict = _create_group_metric_set(y_true=Y_test,
                                    predictions=ys_pred,
                                    sensitive_features=sf,
                                    prediction_type='binary_classification')
In [34]:
exp = Experiment(ws, "EmployeeTurnover_Fairness_Unmitigated_Model")
print(exp)

run = exp.start_logging()

# Upload the dashboard to Azure Machine Learning
try:
    dashboard_title = "Fairness insights of Unmitigated Logistic Regression Classifier for EmployeeTurnover"
    # Set validate_model_ids parameter of upload_dashboard_dictionary to False if you have not registered your model(s)
    upload_id = upload_dashboard_dictionary(run,
                                            dash_dict,
                                            dashboard_name=dashboard_title)
    print("\nUploaded to id: {0}\n".format(upload_id))

    # To test the dashboard, you can download it back and ensure it contains the right information
    downloaded_dict = download_dashboard_by_upload_id(run, upload_id)
finally:
    run.complete()
Experiment(Name: EmployeeTurnover_Fairness_Unmitigated_Model,
Workspace: ml-service)

Uploaded to id: 53c2a7ed-91e7-42d5-b3d4-ded2e9a49c5a

Visualize Fairness in Unmitigated model

We can see that the there is a disparity of 28% in the Race feature which the Fairness Learn algorithm has detected. Also we can see that the Race with code 6 is having maximum disparity of 89.5%. We will now mitigate the bias using GridSearch.

In [16]:
from fairlearn.reductions import GridSearch, DemographicParity, ErrorRate
from sklearn.preprocessing import LabelEncoder, StandardScaler
sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),
                   constraints=DemographicParity(),
                   grid_size=71)

Train GridSearch with training data and Race(Code) feature

In [17]:
sweep.fit(X_train, Y_train,
          sensitive_features=A_train['Race (code)'])

predictors = sweep._predictors
--- Logging error ---
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py36/lib/python3.6/logging/__init__.py", line 994, in emit
    msg = self.format(record)
  File "/anaconda/envs/azureml_py36/lib/python3.6/logging/__init__.py", line 840, in format
    return fmt.format(record)
  File "/anaconda/envs/azureml_py36/lib/python3.6/logging/__init__.py", line 577, in format
    record.message = record.getMessage()
  File "/anaconda/envs/azureml_py36/lib/python3.6/logging/__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/anaconda/envs/azureml_py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda/envs/azureml_py36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 612, in start
    self.io_loop.start()
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 149, in start
    self.asyncio_loop.run_forever()
  File "/anaconda/envs/azureml_py36/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
    self._run_once()
  File "/anaconda/envs/azureml_py36/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
    handle._run()
  File "/anaconda/envs/azureml_py36/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/ioloop.py", line 690, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/gen.py", line 787, in inner
    self.run()
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 365, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 268, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 545, in execute_request
    user_expressions, allow_stdin,
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 306, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2867, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2895, in _run_cell
    return runner(coro)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3072, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3263, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-2dc4320ce01c>", line 2, in <module>
    sensitive_features=A_train['Race (code)'])
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/fairlearn/reductions/_grid_search/grid_search.py", line 149, in fit
    self.grid_offset).grid
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/fairlearn/reductions/_grid_search/_grid_generator.py", line 61, in __init__
    logger.warning(GRID_DIMENSION_WARN_TEMPLATE, true_dim, GRID_DIMENSION_WARN_THRESHOLD)
Message: 'The grid has {} dimensions. It is not recommended to use more than {}, otherwise a prohibitively large grid size is required to explore the space thoroughly. For such cases consider using ExponentiatedGradient from the fairlearn.reductions module.'
Arguments: (6, 4)

Apply the mitigation using parity constraints and run the prediction and detect errors and disparities in the predictors

In [18]:
errors, disparities = [], []
for m in predictors:
    classifier = lambda X: m.predict(X)
    
    error = ErrorRate()
    error.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train['Race (code)'])
    disparity = DemographicParity()
    disparity.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train['Race (code)'])
    
    errors.append(error.gamma(classifier)[0])
    disparities.append(disparity.gamma(classifier).max())
    
all_results = pd.DataFrame( {"predictor": predictors, "error": errors, "disparity": disparities})

dominant_models_dict = dict()
base_name_format = "employeeturnover_gs_model_{0}"
row_id = 0
for row in all_results.itertuples():
    model_name = base_name_format.format(row_id)
    errors_for_lower_or_eq_disparity = all_results["error"][all_results["disparity"]<=row.disparity]
    if row.error <= errors_for_lower_or_eq_disparity.min():
        dominant_models_dict[model_name] = row.predictor
    row_id = row_id + 1
In [19]:
predictions_dominant = {"employeeturnover_unmitigated": unmitigated_predictor.predict(X_test)}
models_dominant = {"employeeturnover_unmitigated": unmitigated_predictor}
for name, predictor in dominant_models_dict.items():
    value = predictor.predict(X_test)
    predictions_dominant[name] = value
    models_dominant[name] = predictor

Register the models in Azure Machine Learning Service

In [21]:
from azureml.core import Workspace, Experiment, Model
import joblib
import os


os.makedirs('models', exist_ok=True)
def register_model(name, model):
    print("Registering ", name)
    model_path = "models/{0}.pkl".format(name)
    joblib.dump(value=model, filename=model_path)
    registered_model = Model.register(model_path=model_path,
                                      model_name=name,
                                      workspace=ws)
    print("Registered ", registered_model.id)
    return registered_model.id

model_name_id_mapping = dict()
for name, model in models_dominant.items():
    m_id = register_model(name, model)
    model_name_id_mapping[name] = m_id
Registering  employeeturnover_unmitigated
Registering model employeeturnover_unmitigated
Registered  employeeturnover_unmitigated:2
Registering  employeeturnover_gs_model_3
Registering model employeeturnover_gs_model_3
Registered  employeeturnover_gs_model_3:2
Registering  employeeturnover_gs_model_13
Registering model employeeturnover_gs_model_13
Registered  employeeturnover_gs_model_13:2
Registering  employeeturnover_gs_model_23
Registering model employeeturnover_gs_model_23
Registered  employeeturnover_gs_model_23:2
Registering  employeeturnover_gs_model_26
Registering model employeeturnover_gs_model_26
Registered  employeeturnover_gs_model_26:2
Registering  employeeturnover_gs_model_42
Registering model employeeturnover_gs_model_42
Registered  employeeturnover_gs_model_42:2
Registering  employeeturnover_gs_model_45
Registering model employeeturnover_gs_model_45
Registered  employeeturnover_gs_model_45:2
Registering  employeeturnover_gs_model_67
Registering model employeeturnover_gs_model_67
Registered  employeeturnover_gs_model_67:2

Create new prediction dictionaries

In [22]:
predictions_dominant_ids = dict()
for name, y_pred in predictions_dominant.items():
    predictions_dominant_ids[model_name_id_mapping[name]] = y_pred

Upload Fairlearn metrics to Azure Machine Learning using an experiment. Fairlearn metrics can be visualized in a Dashboard in AML

First configure the fairlearn metrics

In [23]:
sf = { 'Race': A_test['Race (code)'], 'YearsofService': A_test['Years of Service']}

from fairlearn.metrics._group_metric_set import _create_group_metric_set


dash_dict = _create_group_metric_set(y_true=Y_test,
                                     predictions=predictions_dominant_ids,
                                     sensitive_features=sf,
                                     prediction_type='binary_classification')

Upload metrics to AML by creating a new experiment

In [27]:
from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id
exp = Experiment(ws, "Fairlearn_GridSearch_EmployeeTurnover_1")
print(exp)

run = exp.start_logging()
try:
    dashboard_title = "Dominant Models from GridSearch"
    upload_id = upload_dashboard_dictionary(run,
                                            dash_dict,
                                            dashboard_name=dashboard_title)
    print("\nUploaded to id: {0}\n".format(upload_id))

    downloaded_dict = download_dashboard_by_upload_id(run, upload_id)
finally:
    run.complete()
Experiment(Name: Fairlearn_GridSearch_EmployeeTurnover_1,
Workspace: ml-service)

Uploaded to id: f9dd1977-f09d-43c2-a2f5-efaf28ba2b7c

Visualizing Fairlearn metrics in Mitigated Model

Model Explanations

We will use the TabularExplainer to visualize local features and importance.

Initialize TabularExplainer with Training and prediction model.

We will be using the unmitigated model to understand and visualize explanations.

In [29]:
from interpret.ext.blackbox import TabularExplainer
from azureml.interpret import ExplanationClient
from interpret_community.widget import ExplanationDashboard
# Explain predictions on your local machine
tabular_explainer = TabularExplainer(unmitigated_predictor, X_train, features=X_train.columns.to_numpy())

#client = ExplanationClient.from_run(run)

# Explain overall model predictions (global explanation)
# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data
# x_train can be passed as well, but with more examples explanations it will
# take longer although they may be more accurate
global_explanation = tabular_explainer.explain_global(X_test)

# Uploading model explanation data for storage or visualization in webUX
# The explanation can then be downloaded on any compute
comment = 'Global explanation on regression model trained on EmployeeTurnover dataset'
#client.upload_model_explanation(global_explanation, comment=comment, model_id=original_model.id)
Could not import LIME, required for LIMEExplainer
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
The option feature_dependence has been renamed to feature_perturbation!
The option feature_perturbation="independent" is has been renamed to feature_perturbation="interventional"!

Visaulize the explanation dashboard

In [30]:
ExplanationDashboard(global_explanation, unmitigated_predictor, datasetX=X_test)
Interpret Dashboard Open in new tab
Out[30]:
<interpret_community.widget.explanation_dashboard.ExplanationDashboard at 0x7fb340069cf8>

Interpreting the features using Explanation Dashboard

Using the data explorer built into the Explanation dashboard, we can visualize feature importance of both overall model performance and individual data points.

Visualize Feature importance of individual data points.

As we can in the visual, Current LinkedIn Activity impacts the 'Predicted Y' axis in both classes - Class 0 (Employee did not leave) and Class 1 (Employee Left)

Conclusion

We shall now see if we were able to address the customer scenario

  • Customer would like to understand who is likely to leave company and why

We have created prediction model using logistic regression that can predict which employee is likely to leave. Using explanations we were to understand that Current LinkedIn Activity Code is most like cause of employee leaving We also looked at potential bias in the model by using fairlearn bias mitigation techniques.

  • Activities they can engage in to start decreasing the volume of employees that leave

Customer Business leaders would like to understand causality if at all possible especially to understand which variable they should be looking at and if there are any more that would be worthwhile to try to get for future attempts.

Now that we understand Current LinkedIn Activity Code is the leading cause, we can potential investigate further on this feature chalkout next best actions.

In [ ]: